My brother and a couple of his friends provide movie reviews on the site Criticker.com. I thought it would be fun to provide them with a text analysis of their movie reviews. What follows is that review.
The workflow for this project will start with an exploration of the data,
The code for this analysis can be on my Github page.
You can also find the three reviewers pages here:
The dataset is simple. It is comprised of an xml export for each user and after cleaning, there are only two columns:
quote: The movie review
reviewer: the first name of the reviewer
The following steps were taken to preprocess the data.
First, I only cared about movies that actually had a review so for any movies that just had a score were removed.
Second, the movie reviews were tokenized. Tokenization means taking a review and breaking it out into one word per row. This makes the dataset significantly longer.
Finally, a significant amount of words in the English language are not useful for sentiment analysis or prediction. Some examples are ‘and’ and ‘the’. These words are otherwise known as stop words and have been removed from the dataset.
A good place to start with any analysis is with an exploration of the data.
First, a quick look at how many reviews have been created and how many total words have been written for each reviewer.
| Review Summary | |||||||
|---|---|---|---|---|---|---|---|
| Summary Statistics for Each Reviewer | |||||||
| Reviewer | Review Count | Total Word Count | Word Count excl Stop Words | Avg Word Count per Review | Shortest Review | Longest Review | % of Stop Words |
| Justin | 1,261 | 85,100 | 29,576 | 68 | 1 | 106 | 65.0% |
| Tyler | 2,092 | 140,954 | 43,729 | 67 | 1 | 107 | 69.0% |
| Zach | 604 | 37,305 | 16,022 | 62 | 1 | 100 | 57.0% |
Some observations from the summary statistics. Zach has the fewest total reviews while Tyler has the most. All 3 reviewers have the same average word count per reviews. All three have a one word review while the longest reviews for each are all around the same word count. Finally, Zach has the lowest percentage of stop words among the three leading me to believe his reviews may be more concise than the other two reviewers.
Second, I will take a look at the most common words from all the reviewers combined.
Looking at the image above, the top word used across all three reviewers is ‘movie’, which is not a big surprise. Furthermore, none of the words on the list look out of place for movie reviewers.
After knowing the top ten words, lets look at the top 10 words for each reviewer.
Generally speaking, both Tyler and Justin use some pretty similar words. What is interesting to me is that Zach actually uses ‘film’ more often than ‘movie’, which is less popular between Tyler and Justin. Also, Zach likes to talk about scenes more often than both Tyler or Justin.
Previously we only looked at the top 10 words for each reviewer. Below you will find a visual representation (wordcloud) of all words for each reviewer. To read the wordcloud, the largest text is the word appearing most frequently in the reviews. Below are the wordclouds for Justin, Tyler and Zach.
After looking at the top ten words and all the words for each reviewer, we can visually look at the frequencies of all the words by a reviewer compared to another reviewer.
Above, the frequencies of each word are plotted against each reviewer. Words along the dotted line have similar frequencies between the two reviewers. So, in the left panel, both Tyler and Justin use “movies”, “bad”, and “character” with similar frequencies. Words that are farther from the line are words that appear more often in one reviewers text than the other. For example from the right panel, Zach has the words “disney”, and “viewer” appear more frequently than in Tyler’s while Tyler has “pretty” and “nice” appear more often.
After looking at the counts and frequency of single words, it is time to look at the relationship between words. The first part of section we will be looking at n-grams. N-grams are sets of adjacent words. For example, bigrams are two words in the text that are adaject and trigrams are three adjacent words.
Below are the top bigrams for each reviewer.
Now it is starting to get interesting. I would guess that all three reviewers prefer sci-fi movies because it is the most popular adjacent words in the data sets for them. Zach appears to really like the actor Robert Deniro because he is in the top ten bigrams. Tyler and Justin also appear to watch quite a few horror movies, but Tyler appears to watch more action movies than horror movies. Zach appears to discuss more of the meta parts of films, including special effects, voice acting and physical comedy as well as having a place in his heart for movies with love stories.
As with individual words, we also want to take a look at not just the top ten bigrams, but also the relationship between all bigrams. We will do this using a network graph. A network graph shows the connections between nodes, which are words in this case. Network graphs are really popular in social media data analytics.
We will look at each reviewer separately, starting with Justin. Note, shape of the graphs do not matter. They are randomly generated each time the code is run so no meaning should be derived from them.
A quick glance at the network graph shows some normal connections, for example around ‘movie’. It is connected to typical words related to movies like ‘kids’, ‘horror’, and ‘comic’. He appears to also like to use phrases such as ‘totally worth watching’ or ‘totally worth checking’. It also appears that he likes to talk about actors because they appear frequently: ‘Tom Cruise’,‘Liam Neeson’, and ‘Martin Lawrence’. I’m also trying to figure out why he talks about ‘pro wresting’ so much.
Below is the network graph of Tyler’s bigrams. Like Justin, Tyler also likes to mention actor names, like ‘Tom Cruise’. The most interesting part (to me) is the cluster around ‘movie(s)’. Pretty extinsively, Tyler uses the word ‘movie’ but not just for talking about the genre, but also his feelings about the movie: ‘forgettable’, ‘pretty’, and ‘funniest’.
Finally, we move onto Zach’s bigrams. The first thing you probably noticed is that Zach’s network graph is quite a bit more sparse than the others. That is because he has the lowest review count and so there is a smaller mount of adjacent words. Zach’s reviews appear to frequently discuss the characters, whether they were main or support and if they were memorable. He also apparently likes true love stories.
Now that we have taken a look at the individual and adjacent words, it is time to look at the sentiment of the movie reviews. We are not going to look at the sentiment of individual words because it is a bit too primitive and the English language syntactically complex and lexically rich.
The algorithm I will use is a bit better than sentiment by word. It uses “valence shifters” that help adjust the sentiment score. For example, if you do sentiment analysis on the single word “happy”, the score is positive. Obviously though, if the phrase is “not happy” it is no longer positive but a single word sentiment analysis would not pick that up. The valence shifters will help adjust “not happy” to negative or at least less positive. Also, the sentiment score that will be returned will be an aggregate of each sentence for each review and movie, that way we only look at a single score per reviewer, movie. A score > 0 implies an overall positive sentiment.
First, here are the distributions of the scores by reviewer.
Justin’s distribution of sentiment seems to cluster close to 0 meaning his reviews are a bit more neutral. Tyler and Zach have a heavier tail on the right, with both distributions looking pretty close to a normal distribution.
All three skew to the right. Tyler and Zach seem to have a greater tendence to higher average positive scores for movies. Scatter plot of rating to avg sentiment
Now, let’s take a look at how well the average sentiment scores relate to the actual ratings.
Reviewing the three scatterplots above, there definitely appears to be a positive relationship between rating and average sentiment. However, there also seems to be quite a high level of variablity within the average sentiment for each review.
Finally, it is time to do a little classification modeling. The end goal here will be to see if we can correctly classify the reviews into the categories for each reviewer.
Bucket Stuff:
Justin: Horrific: 0 - 15 Bad: 16 - 34 Average: 35 - 69 Good: 70 - 79 Great: 80 - 89 Exceptional: 90 - 100
Zach: Horrible: 0 - 24 Bad: 25 - 49 Average: 50 - 69 Good: 70 - 79 Great: 80 - 89 Exceptional: 90 - 100
Tyler: Horrible: 0 - 24 Bad: 25 - 49 Average: 50 - 69 Good: 70 - 79 Great: 80 - 89 Exceptional: 90 - 100
This is how each reviewer categorizes their review score.